%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '14px'}}}%%
timeline
title 2020–2025 AI Milestones — ChatGPT Moments, Generative AI and Agentic AI
2020 : GPT-3 — 175 billion parameters, few-shot learning
: Vision Transformer (ViT) — applies transformers to image classification
: AlphaFold2 wins CASP14 — solves protein folding
2021 : DALL·E and CLIP — text-to-image generation and vision-language alignment
: Codex and GitHub Copilot — AI-powered code generation
: AlphaFold Protein Structure Database launched — 200M+ protein structures
2022 : ChatGPT launched — 100M users in two months
: InstructGPT and RLHF — aligning LLMs with human preferences
: Chinchilla — optimal compute-data scaling laws
: DALL·E 2 and Stable Diffusion — photorealistic image generation goes mainstream
: Chain-of-thought prompting — teaching LLMs to reason step by step
2023 : GPT-4 — multimodal reasoning across text and images
: Segment Anything Model (SAM) — foundation model for computer vision
: LLaMA and Llama 2 — Meta open-weights revolution
: Midjourney V5 and DALL·E 3 — generative art reaches new heights
: Microsoft 365 Copilot — AI embedded in enterprise productivity
2024 : GPT-4o — omni-modal real-time AI assistant
: AlphaFold3 — predicts structures of protein-DNA-RNA complexes
: On-device AI — language models on smartphones and laptops
: EU AI Act enters force — first comprehensive AI regulation
: Nobel Prize in Chemistry for AlphaFold creators
2025 : AI agents — planning, tool use, and autonomous multi-step actions
: DeepSeek and open reasoning models challenge frontier labs
: Text-to-video generation — Sora and the next creative frontier
: AI governance frameworks expand worldwide
2020–2025 AI Milestones
ChatGPT Moments, Generative AI & Agentic AI — how large language models, diffusion models, and autonomous agents reshaped civilization
Keywords: AI history, 2020s AI, GPT-3, GPT-4, ChatGPT, OpenAI, large language models, generative AI, DALL-E, Stable Diffusion, Midjourney, diffusion models, AlphaFold2, protein folding, CLIP, vision transformer, ViT, InstructGPT, RLHF, reinforcement learning from human feedback, Chinchilla, scaling laws, chain-of-thought prompting, LLaMA, Llama 2, Meta AI, Segment Anything Model, SAM, GitHub Copilot, Codex, AI agents, agentic AI, multimodal AI, GPT-4o, on-device AI, AI governance, EU AI Act, text-to-video, Sora, DeepSeek, open-source LLMs, AI safety, alignment, reasoning models

Introduction
The first half of the 2020s will be remembered as the era when artificial intelligence left the lab and entered everyday life. In just five years, AI progressed from a powerful but largely invisible technology to a cultural force that reshaped how billions of people work, create, learn, and communicate.
The period opened with a dramatic scaling experiment: in June 2020, OpenAI released GPT-3, a language model with 175 billion parameters that could write essays, code, and poetry from a simple text prompt. The era of few-shot learning had arrived — but the real earthquake came two years later. On November 30, 2022, OpenAI launched ChatGPT, a conversational interface to its large language models. It reached 100 million users in two months — the fastest consumer adoption in history — and ignited a generative AI revolution that swept through every industry on Earth.
Parallel breakthroughs in image generation transformed creativity itself. DALL·E (2021), Stable Diffusion (2022), and Midjourney became cultural phenomena, enabling anyone to generate photorealistic images from text descriptions. In science, AlphaFold2 solved the 50-year-old protein folding problem, predicting the structures of virtually all known proteins and winning its creators the 2024 Nobel Prize in Chemistry.
The march continued with GPT-4 (2023), which demonstrated multimodal reasoning across text and images; LLaMA and its successors, which democratized large language models through open weights; and the Segment Anything Model, which did for computer vision what BERT had done for language. By 2024, multimodal assistants could see, hear, and speak; on-device AI brought language models to smartphones and laptops; and governments worldwide began crafting AI governance frameworks, including the landmark EU AI Act.
By 2025, the frontier had shifted to agentic AI — systems that don’t just answer questions but plan, reason, use tools, and take autonomous actions. The AI agent era had begun, built atop everything that came before: transformers, scaling laws, RLHF alignment, and the hard-won lessons of deploying AI at planetary scale.
This article traces the defining milestones of 2020–2025 — from GPT-3’s few-shot learning revelation, through the ChatGPT moment that changed everything, to the rise of generative AI and the dawn of agentic systems.
Timeline of Key Milestones
GPT-3: The Scale Revolution (2020)
In June 2020, OpenAI released GPT-3 — a language model with 175 billion parameters trained on a massive corpus of internet text. GPT-3 demonstrated a stunning capability that its predecessors only hinted at: few-shot learning. Given just a few examples in a text prompt, it could translate languages, write code, compose poetry, generate business emails, and answer factual questions — all without any task-specific fine-tuning.
The jump from GPT-2’s 1.5 billion parameters to GPT-3’s 175 billion was not merely quantitative. It crossed a threshold where the model exhibited emergent capabilities — behaviors that appeared only at sufficient scale. GPT-3 could perform arithmetic, write SQL queries, and even generate functional code, despite never being explicitly trained for these tasks.
| Aspect | Details |
|---|---|
| Released | June 2020 |
| Developer | OpenAI |
| Parameters | 175 billion (116× larger than GPT-2) |
| Training data | 570 GB of filtered text (Common Crawl, WebText2, Books, Wikipedia) |
| Training cost | Estimated $4.6 million in compute |
| Key capability | Few-shot learning — task performance from prompt examples alone |
| Access model | API-only (no public weights released) |
| Significance | Demonstrated that scale alone could unlock emergent capabilities |
“One of the things that was most surprising about GPT-3 is that it can do things it was never trained to do.” — Sam Altman, CEO of OpenAI
GPT-3 also ignited a debate about the nature of intelligence. Critics argued it was a sophisticated pattern matcher with no genuine understanding; proponents countered that its ability to generalize across tasks suggested something beyond mere memorization. Regardless of the philosophical disputes, GPT-3 proved that massive scale and simple next-token prediction could produce remarkably versatile systems — and set the stage for the ChatGPT moment that would come two years later.
graph LR
A["GPT-1<br/>117M params<br/>(2018)"] --> B["GPT-2<br/>1.5B params<br/>(2019)"]
B --> C["GPT-3<br/>175B params<br/>(2020)"]
C --> D["InstructGPT<br/>RLHF-aligned<br/>(2022)"]
D --> E["ChatGPT<br/>Conversational UI<br/>(Nov 2022)"]
E --> F["GPT-4<br/>Multimodal<br/>(2023)"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#2980b9,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style E fill:#f39c12,color:#fff,stroke:#333
style F fill:#8e44ad,color:#fff,stroke:#333
Vision Transformer (ViT): Transformers Conquer Vision (2020)
In October 2020, researchers at Google Brain published “An Image is Worth 16×16 Words”, introducing the Vision Transformer (ViT). The paper demonstrated that a standard transformer architecture — originally designed for language — could achieve state-of-the-art image classification when applied directly to sequences of image patches, without any convolutional layers.
ViT divided each image into fixed-size patches (typically 16×16 pixels), flattened them into vectors, and processed the resulting sequence using a standard transformer encoder with self-attention. When pretrained on large datasets (JFT-300M), ViT outperformed the best convolutional networks on ImageNet while requiring substantially less computational budget to train.
| Aspect | Details |
|---|---|
| Published | October 2020 |
| Authors | Alexey Dosovitskiy et al. (Google Brain) |
| Architecture | Standard transformer encoder applied to 16×16 image patches |
| Key result | Surpassed state-of-the-art CNNs on ImageNet when pretrained at scale |
| Significance | Unified vision and language under a single transformer architecture |
| Follow-ups | DeiT, Swin Transformer, BEiT, DINO — transformer-based vision became dominant |
“An image is worth 16×16 words” — the paper’s title captured the elegant simplicity of treating image patches as tokens.
ViT’s success had far-reaching consequences. It demonstrated that the transformer architecture was not specific to language but was a general-purpose sequence processor. This insight accelerated research into multimodal models — systems that could process text, images, audio, and video within a unified transformer framework — and paved the way for CLIP, DALL·E, and the multimodal assistants that emerged in 2023–2024.
AlphaFold2: Solving Protein Folding (2020–2021)
In November 2020, DeepMind’s AlphaFold2 achieved a breakthrough that scientists had pursued for 50 years: accurately predicting the three-dimensional structure of proteins from their amino acid sequences. At the CASP14 (Critical Assessment of protein Structure Prediction) competition, AlphaFold2 achieved a median GDT score of 92.4 out of 100 — a level of accuracy comparable to experimental techniques like X-ray crystallography — and made the best prediction for 88 out of 97 targets.
In July 2021, DeepMind and EMBL-EBI launched the AlphaFold Protein Structure Database, initially containing predictions for the human proteome and 20 model organisms. By July 2022, the database expanded to cover over 200 million protein structures — virtually every known protein across all life forms.
| Aspect | Details |
|---|---|
| CASP14 results | November 2020 |
| Developer | DeepMind (Google / Alphabet) |
| Architecture | Evoformer + structure module, end-to-end differentiable |
| CASP14 median GDT | 92.4 / 100 (comparable to experimental methods) |
| Database launched | July 2021 — expanded to 200M+ structures by July 2022 |
| Nobel Prize | 2024 Nobel Prize in Chemistry — Demis Hassabis and John Jumper |
| Significance | Solved a 50-year grand challenge in biology |
Nobel laureate Venki Ramakrishnan called AlphaFold2 “a stunning advance on the protein folding problem… It has occurred decades before many people in the field would have predicted.”
AlphaFold2’s impact on biology and medicine has been transformational. Researchers worldwide use it to understand disease mechanisms, design new drugs, and engineer novel proteins. In 2024, AlphaFold’s creators — Demis Hassabis and John Jumper — shared the Nobel Prize in Chemistry for their work on protein structure prediction, alongside David Baker for computational protein design. AlphaFold3 (2024) extended the approach to predict structures of protein complexes with DNA, RNA, and other molecules.
graph TD
A["Amino Acid<br/>Sequence Input"] --> B["Multiple Sequence<br/>Alignment (MSA)"]
B --> C["Evoformer<br/>(Attention-based)"]
C --> D["Structure<br/>Prediction Module"]
D --> E["3D Protein<br/>Structure Output"]
E --> F["CASP14: 92.4 GDT<br/>(Comparable to<br/>X-ray crystallography)"]
F --> G["200M+ Protein<br/>Structures in Database"]
G --> H["2024 Nobel Prize<br/>in Chemistry"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#2980b9,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style E fill:#f39c12,color:#fff,stroke:#333
style F fill:#8e44ad,color:#fff,stroke:#333
style G fill:#1a5276,color:#fff,stroke:#333
style H fill:#e67e22,color:#fff,stroke:#333
DALL·E and CLIP: Vision Meets Language (2021)
In January 2021, OpenAI unveiled two groundbreaking systems that bridged the gap between vision and language: DALL·E, a model that generated images from text descriptions, and CLIP (Contrastive Language–Image Pre-training), which learned to connect images and text in a shared embedding space.
CLIP was trained on 400 million image-text pairs scraped from the internet, learning to match images with their corresponding text descriptions using contrastive learning. The result was a visual system that could classify images using natural language descriptions — including categories it had never seen during training (zero-shot classification).
DALL·E (a portmanteau of Salvador Dalí and WALL·E) was a 12-billion parameter version of GPT-3, modified to generate images from text prompts. It could produce images of “an armchair in the shape of an avocado” or “a snail made of harp strings” — demonstrating a compositional understanding of language and visual concepts that stunned researchers.
| Aspect | Details |
|---|---|
| Announced | January 2021 |
| Developer | OpenAI |
| CLIP | Contrastive learning on 400M image-text pairs; zero-shot image classification |
| DALL·E | 12B parameter GPT-3 variant; text-to-image generation |
| Key insight | Vision and language can be learned jointly in a shared representation space |
| Impact | Foundation for DALL·E 2, Stable Diffusion, Midjourney, and multimodal AI |
“CLIP effectively generalizes to virtually any visual classification task, merely by describing the classes in natural language.” — OpenAI Research Blog
CLIP’s vision-language alignment became the backbone of the generative image revolution that followed. Stable Diffusion used CLIP’s text encoder to steer its diffusion process; DALL·E 2 and DALL·E 3 built upon CLIP’s multimodal understanding. The insight that vision and language could be unified in a single representation space became one of the defining ideas of the era.
Codex and GitHub Copilot: AI Writes Code (2021)
In August 2021, OpenAI released Codex, a GPT-3 descendant fine-tuned on publicly available code from GitHub. Codex could translate natural language instructions into working code across dozens of programming languages. In June 2021, Microsoft and OpenAI launched GitHub Copilot — an AI pair-programming tool powered by Codex that suggested code completions directly inside the developer’s editor.
Copilot was among the first AI systems to be adopted at massive scale in professional workflows. Within two years, it had millions of users and was generating a significant fraction of the code written by its adopters. Developers described it as transformative — not replacing programmers but dramatically accelerating their work.
| Aspect | Details |
|---|---|
| Codex released | August 2021 |
| GitHub Copilot launched | June 2021 (technical preview), general availability June 2022 |
| Developer | OpenAI (Codex), GitHub / Microsoft (Copilot) |
| Based on | GPT-3, fine-tuned on public GitHub code |
| Languages | Python, JavaScript, TypeScript, Go, Ruby, and dozens more |
| Significance | First widely adopted AI tool for professional software development |
| Impact | Millions of developers; accelerated coding velocity by 30–55% in studies |
“Copilot writes the boring code so I can focus on the interesting code.” — common developer sentiment, 2022
GitHub Copilot demonstrated that LLMs could serve as practical, everyday productivity tools — not just research curiosities. Its success paved the way for Microsoft 365 Copilot (2023), which brought the same paradigm to documents, spreadsheets, presentations, and email, embedding AI assistance into the core of enterprise productivity.
ChatGPT: The Moment Everything Changed (November 2022)
On November 30, 2022, OpenAI released ChatGPT — a conversational interface to its GPT-3.5 language model, fine-tuned using Reinforcement Learning from Human Feedback (RLHF) to be helpful, harmless, and honest. Within five days, it had one million users. Within two months, it reached 100 million monthly active users — making it the fastest-growing consumer application in history.
ChatGPT did not introduce fundamentally new AI capabilities. GPT-3 already existed; RLHF had been published in the InstructGPT paper earlier in 2022. What ChatGPT achieved was something more profound: it made AI accessible to everyone. The conversational interface — simple, free, and requiring no technical expertise — invited hundreds of millions of people to interact directly with a large language model for the first time.
| Aspect | Details |
|---|---|
| Launched | November 30, 2022 |
| Developer | OpenAI |
| Underlying model | GPT-3.5, fine-tuned with RLHF |
| 1 million users | Within 5 days |
| 100 million users | Within 2 months (fastest-growing consumer app ever) |
| ChatGPT Plus | Launched February 2023 ($20/month) |
| Key innovation | Conversational UI + RLHF alignment made LLMs accessible to everyone |
| Impact | Ignited the generative AI boom; triggered industry-wide AI arms race |
Kevin Roose of The New York Times called ChatGPT “the best artificial intelligence chatbot ever released to the general public.”
The ripple effects were immediate and seismic. Google declared a “code red” and rushed to launch its own chatbot, Bard (later Gemini). Microsoft invested $10 billion in OpenAI and integrated GPT-4 into Bing. Every major tech company pivoted to generative AI. Startups raised billions. Universities debated how to handle AI-generated assignments. And for the first time in history, hundreds of millions of ordinary people experienced the power — and the limitations — of conversational AI directly.
graph TD
A["GPT-3<br/>(June 2020)"] --> B["InstructGPT + RLHF<br/>(March 2022)"]
B --> C["ChatGPT<br/>(Nov 30, 2022)"]
C --> D["100M Users<br/>in 2 Months"]
D --> E["Google Bard<br/>Microsoft Bing Chat<br/>Industry AI Arms Race"]
C --> F["ChatGPT Plus<br/>(Feb 2023)"]
F --> G["GPT-4 Integration<br/>(March 2023)"]
G --> H["Plugins, Browsing,<br/>Code Interpreter"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#f39c12,color:#fff,stroke:#333
style E fill:#8e44ad,color:#fff,stroke:#333
style F fill:#1a5276,color:#fff,stroke:#333
style G fill:#2980b9,color:#fff,stroke:#333
style H fill:#e67e22,color:#fff,stroke:#333
InstructGPT and RLHF: Aligning AI with Human Values (2022)
In March 2022, OpenAI published the InstructGPT paper, describing how they used Reinforcement Learning from Human Feedback (RLHF) to align language models with human intentions. The key insight was simple but powerful: instead of training only on next-token prediction, you could fine-tune a model to follow instructions, be truthful, and avoid harmful outputs — by using human preferences as a training signal.
The process involved three steps: (1) collect human demonstrations of ideal responses and fine-tune the model via supervised learning; (2) have human raters rank multiple model outputs to train a reward model; (3) use the reward model to fine-tune the language model via Proximal Policy Optimization (PPO).
| Aspect | Details |
|---|---|
| Published | March 2022 |
| Developer | OpenAI |
| Technique | RLHF — Reinforcement Learning from Human Feedback |
| Steps | Supervised fine-tuning → Reward model training → PPO optimization |
| Applied to | GPT-3 (1.3B parameters) initially; later GPT-3.5 and GPT-4 |
| Key result | A 1.3B InstructGPT was preferred over the 175B GPT-3 by human raters |
| Significance | Established RLHF as the standard method for aligning LLMs |
“Our 1.3 billion parameter InstructGPT model outputs are preferred to outputs of the 175 billion parameter GPT-3, despite having 100× fewer parameters.” — Ouyang et al. (2022)
RLHF became the default alignment technique across the industry. Google used it for Gemini, Anthropic refined it into RLAIF (RL from AI Feedback) and Constitutional AI, and Meta applied it to Llama. The InstructGPT paper established that alignment is not just about making models larger — it’s about making them better at following human intent — a lesson that underpins the entire modern AI stack.
Chinchilla and Scaling Laws: How to Train Efficiently (2022)
In March 2022, DeepMind published “Training Compute-Optimal Large Language Models”, introducing the Chinchilla model. The paper challenged the prevailing wisdom on scaling — demonstrating that most large language models were significantly undertrained relative to their size.
The key finding: for a given compute budget, the optimal strategy is to scale model size and training data equally. A 70-billion parameter model trained on 1.4 trillion tokens (Chinchilla) outperformed the 280-billion parameter Gopher trained on 300 billion tokens — despite using four times fewer parameters.
| Aspect | Details |
|---|---|
| Published | March 2022 |
| Developer | DeepMind |
| Model | Chinchilla (70B parameters, 1.4T training tokens) |
| Key finding | Models should be trained on ~20 tokens per parameter for compute-optimality |
| Result | Chinchilla (70B) outperformed Gopher (280B) on most benchmarks |
| Impact | Shifted industry focus from parameter count to data quality and quantity |
“For every doubling of model size, the number of training tokens should also be doubled.” — Hoffmann et al. (2022)
Chinchilla’s scaling laws reshaped how every lab trained large models. Meta’s LLaMA (2023) explicitly followed Chinchilla-optimal ratios, training a 65B parameter model on 1.4 trillion tokens. The lesson propagated industry-wide: more data, not just more parameters, was the path to better models.
DALL·E 2 and Stable Diffusion: The Generative Image Revolution (2022)
In April 2022, OpenAI released DALL·E 2, which used a diffusion-based approach (replacing the original DALL·E’s autoregressive method) to generate photorealistic images from text prompts at much higher resolution and fidelity. Then in August 2022, Stability AI released Stable Diffusion — an open-source latent diffusion model that could run on consumer GPUs.
Stable Diffusion democratized image generation. Unlike DALL·E 2, which was available only through an API, Stable Diffusion’s weights and code were publicly available. Anyone with a modest GPU could generate, modify, and fine-tune their own image generation models. Within weeks, a vibrant ecosystem of tools, extensions, and communities emerged.
| Aspect | Details |
|---|---|
| DALL·E 2 released | April 2022 |
| Stable Diffusion released | August 2022 |
| SD architecture | Latent Diffusion Model (VAE + U-Net + CLIP text encoder) |
| SD parameters | ~860M (U-Net) + 123M (text encoder) |
| SD training cost | ~$600,000 on 256 NVIDIA A100 GPUs |
| Key innovation | Diffusion in latent space — high quality at lower compute cost |
| Impact | Open-source image generation; ran on consumer hardware |
“Stable Diffusion marked the moment when AI image generation left the laboratory and entered the hands of millions.” — MIT Technology Review
The generative image revolution raised profound questions about copyright, consent, and the future of creative work. Artists protested that their styles were being replicated without permission. Legal battles ensued — Getty Images sued Stability AI; artists sued both Stability AI and Midjourney. Meanwhile, the technology continued to advance: Midjourney V5 (2023) produced images indistinguishable from photographs, and DALL·E 3 (October 2023) was integrated directly into ChatGPT.
graph LR
A["DALL·E 1<br/>(Jan 2021)<br/>Autoregressive"] --> B["DALL·E 2<br/>(Apr 2022)<br/>Diffusion"]
C["Latent Diffusion<br/>(LMU Munich, 2021)"] --> D["Stable Diffusion<br/>(Aug 2022)<br/>Open Source"]
B --> E["DALL·E 3<br/>(Oct 2023)<br/>ChatGPT Integration"]
D --> F["SD XL · SD 3<br/>Community Ecosystem"]
G["Midjourney<br/>(2022–2023)"] --> H["Generative AI<br/>Cultural Phenomenon"]
E --> H
F --> H
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#2980b9,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#e74c3c,color:#fff,stroke:#333
style E fill:#8e44ad,color:#fff,stroke:#333
style F fill:#f39c12,color:#fff,stroke:#333
style G fill:#1a5276,color:#fff,stroke:#333
style H fill:#e67e22,color:#fff,stroke:#333
Chain-of-Thought Prompting: Teaching LLMs to Reason (2022)
In January 2022, Jason Wei and colleagues at Google Brain published a paper demonstrating that adding the phrase “Let’s think step by step” to a prompt — or providing worked examples with explicit intermediate reasoning — could dramatically improve LLM performance on math, logic, and multi-step reasoning tasks. This technique became known as chain-of-thought (CoT) prompting.
The insight was deceptively simple: LLMs trained on next-token prediction had learned to reason, but only when the reasoning process was made explicit in the output. By prompting the model to “show its work,” performance on arithmetic, commonsense reasoning, and symbolic manipulation improved by up to 60% on some benchmarks.
| Aspect | Details |
|---|---|
| Published | January 2022 (Wei et al.) |
| Developers | Google Brain |
| Technique | Include reasoning steps in few-shot examples or instruct “think step by step” |
| Key result | GSM8K math accuracy improved from ~18% to ~57% on PaLM 540B |
| Variants | Zero-shot CoT, self-consistency, tree-of-thought, chain-of-thought distillation |
| Impact | Foundation for reasoning models (o1, o3) and agentic workflows |
“Chain-of-thought prompting allows large language models to decompose multi-step problems into intermediate steps, significantly improving reasoning.” — Wei et al. (2022)
Chain-of-thought prompting was more than a prompt engineering trick. It revealed that reasoning is an emergent capability of scale — appearing only in sufficiently large models — and it laid the conceptual foundation for the reasoning models that emerged in 2024–2025, including OpenAI’s o1 and o3 series, which internalized chain-of-thought as a core inference mechanism.
GPT-4: Multimodal Intelligence (March 2023)
On March 14, 2023, OpenAI released GPT-4 — its most capable large language model at the time. GPT-4 was the first commercially deployed model to accept both text and image inputs (multimodal), producing text outputs that demonstrated markedly improved reasoning, factuality, and instruction-following compared to its predecessors.
GPT-4 passed the bar exam in the 90th percentile, scored in the 99th percentile on the GRE verbal section, and demonstrated substantial improvements on coding benchmarks. Its multimodal capability allowed users to upload images and ask questions about them — a preview of the visual reasoning that would become central to AI assistants.
| Aspect | Details |
|---|---|
| Released | March 14, 2023 |
| Developer | OpenAI |
| Modalities | Text input + image input → text output |
| Bar exam performance | 90th percentile (vs. GPT-3.5’s 10th percentile) |
| GRE verbal | 99th percentile |
| Context window | 8K and 32K token variants |
| Follow-ups | GPT-4 Turbo (Nov 2023), GPT-4o (May 2024), GPT-4o mini (Jul 2024) |
| Significance | First commercially deployed multimodal LLM at scale |
“GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.” — OpenAI Technical Report
GPT-4 was quickly integrated into ChatGPT Plus, Bing Chat (now Microsoft Copilot), and hundreds of enterprise applications. It validated the bet that scaling plus RLHF alignment could produce models with genuine utility across law, medicine, coding, education, and creative work.
Segment Anything Model (SAM): Foundation Model for Vision (April 2023)
In April 2023, Meta AI released the Segment Anything Model (SAM) alongside the SA-1B dataset — the largest segmentation dataset ever created, containing over 1 billion masks on 11 million images. SAM could segment any object in any image, prompted by a point, a bounding box, or a text description — without being trained on that specific object category.
SAM did for computer vision what GPT-3 did for language: it demonstrated that a single, large foundation model could generalize across virtually any visual segmentation task. It was a zero-shot, promptable vision model — a paradigm shift from task-specific models that required custom training for each new class of objects.
| Aspect | Details |
|---|---|
| Released | April 2023 |
| Developer | Meta AI Research (FAIR) |
| Dataset | SA-1B — 1.1 billion masks on 11 million images |
| Architecture | Image encoder (ViT-H) + prompt encoder + mask decoder |
| Key capability | Zero-shot segmentation of any object from points, boxes, or text |
| Follow-up | SAM 2 (2024) extended to video segmentation |
| Significance | Foundation model paradigm applied to computer vision segmentation |
“SAM is to image segmentation what GPT-3 was to text generation — a demonstration that foundation models can generalize across an entire domain.” — Meta AI Blog
SAM accelerated research in autonomous driving, medical imaging, robotics, augmented reality, and video editing. Its release as open source ensured rapid adoption and inspired dozens of follow-up projects that extended the approach to 3D, video, and domain-specific applications.
LLaMA and Open-Source Language Models (2023)
In February 2023, Meta AI released LLaMA (Large Language Model Meta AI) — a family of language models ranging from 7B to 65B parameters. Unlike GPT-4, LLaMA’s weights were made available to the research community (and soon leaked publicly), igniting an open-source AI revolution.
LLaMA followed Chinchilla-optimal scaling: the 65B model was trained on 1.4 trillion tokens — far more data per parameter than previous models. The result was a 65B model that matched or exceeded the performance of much larger proprietary models on many benchmarks. In July 2023, Meta released Llama 2 with a commercial license, followed by Llama 3 in April 2024 — each iteration narrowing the gap with frontier proprietary models.
| Aspect | Details |
|---|---|
| LLaMA released | February 2023 |
| Llama 2 released | July 2023 (with commercial license) |
| Llama 3 released | April 2024 (8B and 70B), December 2024 (405B) |
| Developer | Meta AI |
| LLaMA sizes | 7B, 13B, 33B, 65B parameters |
| Training data | 1.0T – 1.4T tokens (publicly available data) |
| Key innovation | Chinchilla-optimal training; open weights with commercial license |
| Impact | Spawned thousands of open-source derivatives and fine-tunes |
“Our mission is to open up access to AI so that more people and institutions can explore, research, and benefit from it.” — Meta AI
LLaMA’s release shattered the assumption that only closed-source labs could produce competitive language models. Within weeks, the open-source community produced Alpaca, Vicuna, WizardLM, and hundreds of fine-tuned variants. By 2024, open models regularly competed with proprietary ones on key benchmarks. In early 2025, DeepSeek-R1 from China demonstrated that open reasoning models could match frontier performance, further validating the open-source approach.
graph TD
A["LLaMA<br/>(Feb 2023)<br/>Research License"] --> B["Llama 2<br/>(Jul 2023)<br/>Commercial License"]
B --> C["Llama 3<br/>(Apr 2024)<br/>8B, 70B, 405B"]
A --> D["Open-Source<br/>Explosion"]
D --> E["Alpaca · Vicuna<br/>WizardLM"]
D --> F["Mistral · Mixtral<br/>Qwen · DeepSeek"]
C --> G["Narrowing Gap with<br/>Frontier Proprietary<br/>Models"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#2980b9,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style E fill:#f39c12,color:#fff,stroke:#333
style F fill:#8e44ad,color:#fff,stroke:#333
style G fill:#1a5276,color:#fff,stroke:#333
Generative Art and Creative AI (2022–2023)
By 2023, generative AI had become a cultural phenomenon. Midjourney — a text-to-image service accessible via Discord — produced images so stunning that a Midjourney-generated artwork won a prize at the Colorado State Fair art competition in September 2022, igniting fierce debate about the nature of creativity and authorship.
DALL·E 3 (October 2023) was integrated directly into ChatGPT, allowing users to generate and refine images through natural conversation. Stable Diffusion XL (July 2023) introduced native 1024×1024 resolution and dramatically improved image quality. And the open-source community built an ecosystem of tools — ControlNet, LoRA fine-tuning, ComfyUI — that gave artists and developers unprecedented control over the generation process.
| Aspect | Details |
|---|---|
| Midjourney V5 | March 2023 — photorealistic quality |
| DALL·E 3 | October 2023 — integrated into ChatGPT |
| Stable Diffusion XL | July 2023 — 1024×1024, 3.5B parameters |
| Colorado controversy | September 2022 — Midjourney artwork wins art prize |
| Key tools | ControlNet, LoRA, DreamBooth, ComfyUI, AUTOMATIC1111 |
| Impact | Democratized visual creation; challenged traditional art markets |
“AI-generated art is the most disruptive thing to happen to visual culture since the invention of photography.” — artist and critic Jason Allen, 2022
Generative AI raised existential questions for creative professionals. Could AI replace artists, designers, and photographers? Was AI-generated content copyrightable? The legal and cultural debates intensified: the U.S. Copyright Office ruled that purely AI-generated images could not be copyrighted, while courts grappled with whether training on copyrighted data constituted fair use.
Microsoft 365 Copilot: AI in Enterprise (2023)
In March 2023, Microsoft announced Microsoft 365 Copilot, bringing GPT-4-powered AI assistance into Word, Excel, PowerPoint, Outlook, and Teams. This was a watershed moment: for the first time, large language models were embedded directly into the productivity tools used by hundreds of millions of knowledge workers worldwide.
Copilot could draft documents, summarize email threads, generate presentations from outlines, analyze spreadsheets with natural language queries, and take meeting notes in Teams. It demonstrated that LLMs could augment — rather than replace — human knowledge work.
| Aspect | Details |
|---|---|
| Announced | March 2023 |
| General availability | November 2023 |
| Powered by | GPT-4, Microsoft Graph (user context) |
| Applications | Word, Excel, PowerPoint, Outlook, Teams |
| Pricing | $30/user/month (enterprise) |
| Significance | LLMs embedded in enterprise productivity at planetary scale |
“Copilot is not just a better autocomplete — it’s a new way of working, where AI and human intelligence amplify each other.” — Satya Nadella, CEO of Microsoft
The launch of Microsoft 365 Copilot, alongside Google’s Duet AI for Workspace (later Gemini for Google Workspace), marked the beginning of the AI-augmented workplace. By 2024, AI assistance in documents, code, email, and data analysis was rapidly becoming the default expectation in enterprise environments.
Multimodal AI and GPT-4o (2024)
On May 13, 2024, OpenAI released GPT-4o (“o” for “omni”) — a model natively designed to process and generate text, audio, and images in a unified architecture. GPT-4o could engage in real-time voice conversations with human-like latency (~320 ms average response time), analyze images, and generate both text and audio outputs.
GPT-4o represented a fundamental shift from the chatbot paradigm to a multimodal assistant paradigm. It could understand tone, emotion, and context in voice conversations; describe and analyze visual scenes; and seamlessly switch between modalities — all at substantially faster speed and lower cost than GPT-4.
| Aspect | Details |
|---|---|
| Released | May 13, 2024 |
| Developer | OpenAI |
| Modalities | Text + audio + image (input and output) |
| Response latency | ~320 ms (comparable to human conversation) |
| Follow-ups | GPT-4o mini (Jul 2024, cost-optimized) |
| Key advance | Natively multimodal — not separate models stitched together |
| Significance | Shifted AI from text chatbot to omni-modal real-time assistant |
“The technology is moving so fast — GPT-4o is the kind of AI interaction we used to see only in science fiction movies.” — tech reviewer reaction, May 2024
Google’s Gemini models (2023–2024) pursued a parallel path toward natively multimodal architecture. By late 2024, the expectation for frontier AI systems was that they would be multimodal by default — understanding and generating across text, image, audio, and increasingly video.
On-Device AI and Model Efficiency (2024)
In 2024, the AI industry underwent a paradigm shift toward running capable language models directly on consumer devices — smartphones, laptops, and edge hardware — rather than relying solely on cloud APIs. Apple Intelligence brought on-device models to iPhones and Macs; Google embedded Gemini Nano into Pixel phones; and Qualcomm, Intel, and AMD shipped dedicated neural processing units (NPUs) optimized for transformer inference.
This shift was enabled by years of research into model compression techniques: quantization (reducing precision from 32-bit to 4-bit or lower), distillation (training smaller models to mimic larger ones), pruning, and efficient architectures like mixture-of-experts.
| Aspect | Details |
|---|---|
| Apple Intelligence | Announced June 2024, iOS 18 / macOS Sequoia |
| Gemini Nano | Deployed on Pixel devices for on-device chat and summarization |
| Hardware | NPUs in Qualcomm Snapdragon, Intel Core Ultra, Apple M-series |
| Techniques | Quantization (GPTQ, GGML, AWQ), distillation, pruning, LoRA adapters |
| Models | Phi-3 (Microsoft), Gemma (Google), Llama 3.2 (Meta) — designed for on-device |
| Significance | AI inference without cloud dependency; privacy-preserving, low-latency |
“On-device AI means your AI assistant works even without an internet connection — and your data never leaves your phone.” — Apple, WWDC 2024
On-device AI addressed critical concerns about privacy, latency, and cost. By running models locally, sensitive data never had to be transmitted to the cloud. And for developers, eliminating per-query API costs fundamentally changed the economics of AI-powered applications.
AI Governance and the EU AI Act (2024)
As AI capabilities accelerated, governments worldwide moved to establish regulatory frameworks. The most significant was the EU AI Act, which entered into force on August 1, 2024 — the world’s first comprehensive legal framework for artificial intelligence.
The EU AI Act classified AI systems into risk tiers: unacceptable (banned — e.g., social scoring, real-time facial recognition in public spaces), high-risk (subject to strict requirements), limited risk (transparency obligations), and minimal risk (largely unregulated). General-purpose AI models like GPT-4 and Llama fell under specific provisions requiring transparency and documentation.
| Aspect | Details |
|---|---|
| EU AI Act | Entered into force August 1, 2024 |
| U.S. Executive Order | October 30, 2023 (AI safety and security) |
| China AI regulations | Generative AI rules effective August 2023 |
| G7 Hiroshima AI Process | International voluntary code of conduct |
| Key principles | Risk-based classification, transparency, human oversight |
| Frontier model provisions | Safety testing, red-teaming, incident reporting |
| Significance | First comprehensive AI regulation; global template for governance |
“The AI Act is not about saying ‘no’ to AI — it’s about building trust so that AI can be adopted more widely.” — European Commission
In parallel, AI companies adopted voluntary safety commitments: red-teaming, model evaluations, watermarking of AI-generated content, and responsible disclosure. The 2024 Nobel Prize for Chemistry (AlphaFold) and Physics (neural networks by Hinton and Hopfield) highlighted both AI’s transformative potential and the urgency of thoughtful governance.
AI Agents: The Next Frontier (2025)
By 2025, the focus of AI research and deployment shifted decisively toward agentic AI — systems that don’t just respond to queries but plan, reason, use tools, and execute multi-step tasks autonomously. AI agents could browse the web, write and execute code, manage files, call APIs, and chain together complex workflows with minimal human intervention.
OpenAI launched Operator (January 2025) for browser-based task automation, followed by Codex (May 2025) for autonomous software engineering, and a general-purpose ChatGPT Agent (July 2025). Google, Anthropic, and Microsoft deployed their own agent frameworks. The paradigm shifted from “chatbot you talk to” to “assistant that works for you.”
| Aspect | Details |
|---|---|
| OpenAI Operator | January 2025 — autonomous web browsing and task execution |
| OpenAI Codex agent | May 2025 — autonomous software engineering |
| ChatGPT Agent | July 2025 — general-purpose task agent |
| Anthropic Claude | Tool use, computer use, multi-step reasoning capabilities |
| Google Project Mariner | Agent framework for complex task delegation |
| Key capabilities | Planning, tool use, code execution, multi-step reasoning, memory |
| Significance | Shift from conversational AI to autonomous task execution |
“We’re moving from AI that answers questions to AI that actually does things for you.” — Dario Amodei, CEO of Anthropic
The agentic paradigm brought new challenges: reliability (agents could make mistakes that compound over multiple steps), safety (autonomous actions require guardrails), and trust (users needed confidence that agents would act within approved boundaries). But the potential was enormous: AI agents promised to automate entire workflows — from research and data analysis to coding, scheduling, and content creation — fundamentally reshaping knowledge work.
graph TD
A["User Intent<br/>(Natural Language)"] --> B["Planning Module<br/>(Decompose into Steps)"]
B --> C["Tool Selection<br/>(APIs, Code, Browser)"]
C --> D["Execution<br/>(Multi-step Actions)"]
D --> E["Observation<br/>& Reflection"]
E -->|"Iterate"| B
E --> F["Final Result<br/>Delivered to User"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#f39c12,color:#fff,stroke:#333
style E fill:#8e44ad,color:#fff,stroke:#333
style F fill:#1a5276,color:#fff,stroke:#333
DeepSeek and the Open Reasoning Revolution (2025)
In January 2025, Chinese AI lab DeepSeek released DeepSeek-R1 — an open-weight reasoning model that matched or exceeded the performance of OpenAI’s o1 on mathematics, coding, and scientific reasoning benchmarks. DeepSeek-R1 was notable for its transparency: it explicitly showed its chain-of-thought reasoning process, and its weights were freely available.
DeepSeek’s breakthrough demonstrated that frontier AI capabilities were no longer the exclusive domain of a handful of well-funded Western labs. It also validated the viability of open reasoning models — LLMs that could perform extended multi-step reasoning with full weight availability for the research community.
| Aspect | Details |
|---|---|
| Released | January 2025 |
| Developer | DeepSeek (China) |
| Key models | DeepSeek-V3, DeepSeek-R1 (reasoning) |
| R1 performance | Competitive with OpenAI o1 on math, code, and science |
| Training efficiency | Reportedly trained at significantly lower cost than Western models |
| License | Open weights |
| Significance | Proved open models could match frontier reasoning capabilities |
“DeepSeek-R1 reminded the world that AI innovation is global — and that openness accelerates progress.” — AI researcher reaction, January 2025
DeepSeek’s success intensified the global race in AI development and prompted Western labs to accelerate their own reasoning model efforts. OpenAI responded with o3 (April 2025), and the competition between open and closed reasoning models became one of the defining dynamics of the AI landscape.
Text-to-Video and the Expanding Creative Frontier (2024–2025)
As image generation matured, the frontier moved to video. In February 2024, OpenAI previewed Sora — a diffusion transformer model that could generate realistic minute-long videos from text prompts. Sora produced coherent, cinematic footage with consistent characters, camera movements, and physical interactions that far surpassed previous video generation attempts.
By 2025, multiple labs had released video generation models: Google’s Veo, Runway’s Gen-3, and open-source alternatives. While none matched Hollywood production quality, they represented a paradigm shift — the ability to generate visual narratives from text descriptions, with implications for filmmaking, advertising, education, and entertainment.
| Aspect | Details |
|---|---|
| Sora preview | February 2024 (OpenAI) |
| Sora public release | December 2024 |
| Competitors | Google Veo, Runway Gen-3, Pika, Kling |
| Capabilities | Minute-long coherent video from text prompts |
| Technical approach | Diffusion transformer trained on video data |
| Challenges | Physics consistency, long-form coherence, compute cost |
| Significance | Extended generative AI from static images to temporal narratives |
“The leap from image generation to video generation is like going from photography to cinema — it opens up entirely new forms of expression.”
Text-to-video, alongside advances in 3D generation and world models, pointed toward a future where AI could generate entire virtual environments, interactive experiences, and personalized media on demand.
The 2020–2025 Transformation at a Glance
The half-decade from 2020 to 2025 transformed AI from a powerful but specialized technology into a general-purpose tool reshaping civilization. The speed and breadth of change was without precedent:
| Dimension | 2019 State | 2025 State |
|---|---|---|
| Largest language model | GPT-2 (1.5B parameters) | GPT-5, Gemini Ultra, Claude, Llama 3.1 (100B–1T+) |
| Chat AI users | Virtually none | Hundreds of millions weekly |
| Image generation | Research demos | Billions of images generated; integrated into consumer apps |
| Video generation | Primitive | Minute-long coherent videos from text |
| Code generation | Auto-complete | Autonomous coding agents |
| AI in science | Promising early results | Nobel Prize–winning breakthroughs (AlphaFold) |
| AI regulation | Minimal | EU AI Act, executive orders, international frameworks |
| On-device AI | None | Full language models running on smartphones |
| AI agents | Concept/research | Deployed agent products (Operator, Copilot, Codex) |
| Industry investment | Billions annually | Hundreds of billions annually |
By 2025, AI was no longer a future technology. It was the present — woven into search, creativity, productivity, science, governance, and daily life. And the pace showed no signs of slowing.
Video: 2020–2025 AI Milestones — ChatGPT Moments, Generative AI & Agentic AI
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
References
- Brown, T. et al. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (2020). arXiv:2005.14165
- Dosovitskiy, A. et al. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” ICLR (2021). arXiv:2010.11929
- Jumper, J. et al. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596, 583–589 (2021).
- Radford, A. et al. “Learning Transferable Visual Models From Natural Language Supervision.” ICML (2021). arXiv:2103.00020
- Ouyang, L. et al. “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS 35 (2022). arXiv:2203.02155
- Hoffmann, J. et al. “Training Compute-Optimal Large Language Models.” NeurIPS 35 (2022). arXiv:2203.15556
- Rombach, R. et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” CVPR (2022). arXiv:2112.10752
- Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 35 (2022). arXiv:2201.11903
- OpenAI. “GPT-4 Technical Report.” arXiv:2303.08774 (2023).
- Kirillov, A. et al. “Segment Anything.” ICCV (2023). arXiv:2304.02643
- Touvron, H. et al. “LLaMA: Open and Efficient Foundation Language Models.” arXiv:2302.13971 (2023).
- Touvron, H. et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv:2307.09288 (2023).
- Abramson, J. et al. “Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” Nature 630, 493–500 (2024).
- Wikipedia. “ChatGPT.” en.wikipedia.org/wiki/ChatGPT
- Wikipedia. “AlphaFold.” en.wikipedia.org/wiki/AlphaFold
- Wikipedia. “Stable Diffusion.” en.wikipedia.org/wiki/Stable_Diffusion
Read More
- See the decade of deep learning that preceded the generative AI era — 2010s AI Milestones
- The infrastructure decade that enabled modern AI — 2000s AI Milestones
- The data revolution and statistical learning — 1990s AI Milestones
- From expert systems to the second AI winter — 1980s AI Milestones
- The first AI winter and the seeds of recovery — 1970s AI Milestones
- Where it all began — 1950s–1960s AI Milestones
- How transformers power modern language models — Pre-training LLMs from Scratch
- Modern methods for aligning LLMs — Post-Training LLMs for Human Alignment
- From prompts to context — Prompt Engineering vs Context Engineering
- Scaling inference for production — Scaling LLM Serving for Enterprise Production